Contextual Entity Resolution Approach for Genealogical Data

نویسندگان

  • Hossein Rahmani
  • Bijan Ranjbar Sahraei
  • Gerhard Weiss
  • Karl Tuyls
چکیده

Due to huge amount of inaccurate information and different types of ambiguity in the available digitized genealogical data, applying Entity Resolution techniques for determining the records referring to the same entity should be considered as the first and still very important step in analysis of this type of data. Traditional methods, use a standard string similarity measure to calculate the similarity among references, neglecting the contextual information available for each reference, and then introduce the most similar pairs as matches. In this paper, first, we introduce a novel blocking strategy to reduce the number of potential candidate pairs. Second, we propose a contextual similarity measure which not only considers the string similarity among references but also contextual information available for them. Third, we evaluate our proposed method extensively from different perspectives and among many discussed patterns, the “early child death” pattern discovered to be prominent.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Entity resolution in disjoint graphs: An application on genealogical data

Entity Resolution (ER) is the process of identifying references referring to the same entity from one or more data sources. In the ER process, most existing approaches exploit the content information of references, categorized as contentbased ER, or additionally consider linkage information among references, categorized as context-based ER. However, in new applications of ER, such as in the gen...

متن کامل

The Effect of Transitive Closure on the Calibration of Logistic Regression for Entity Resolution

This paper describes a series of experiments in using logistic regression machine learning as a method for entity resolution. From these experiments the authors concluded that when a supervised ML algorithm is trained to classify a pair of entity references as linked or not linked pair, the evaluation of the model’s performance should take into account the transitive closure of its pairwise lin...

متن کامل

Entity Type Recognition for Heterogeneous Semantic Graphs

We describe an approach to reducing the computational cost of identifying coreferent instances in heterogeneous semantic graphs where the underlying ontologies may not be informative or even known. The problem is similar to coreference resolution in unstructured text, where a variety of linguistic clues and contextual information is used to infer entity types and predict coreference. Semantic g...

متن کامل

Improvement of Chemical Named Entity Recognition through Sentence-based Random Under-sampling and Classifier Combination

Chemical Named Entity Recognition (NER) is the basic step for consequent information extraction tasks such as named entity resolution, drug-drug interaction discovery, extraction of the names of the molecules and their properties. Improvement in the performance of such systems may affects the quality of the subsequent tasks. Chemical text from which data for named entity recognition is extracte...

متن کامل

Corpus based coreference resolution for Farsi text

"Coreference resolution" or "finding all expressions that refer to the same entity" in a text, is one of the important requirements in natural language processing. Two words are coreference when both refer to a single entity in the text or the real world. So the main task of coreference resolution systems is to identify terms that refer to a unique entity. A coreference resolution tool could be...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2014